Executive Summary

data <- data %>% mutate( Local_DateTime = dmy_hm(Local Time), Month = month(Local_DateTime), WeekdayNum = wday(Local_DateTime), Year = year(Local_DateTime), Date = as.Date(Local_DateTime), Hour = hour(Local_DateTime), # Extract hour winter_year = ifelse(Month >= 11, Year, Year - 1), winter_start = as.Date(paste0(winter_year, “-11-01”)), DSN = as.numeric(Date - winter_start) )

Add this at the end once we have our results

Method and Approach

The main aim of this report was to enable NESO to estimate demand patterns in electricity over the long-term. To do this we considered several ordinary least squares (OLS) regression model each fitted to assess the association between a set of predictor variables and average daily demand. Our predictor variables included both temporal and weather based variables. We then chose a model by comparing important scores such as ____ (which ______). We further adapted it to ensure practicality and simplicity of the model. Finally we analyse the predictive ability of our model by applying cross validation to ensure the model has strong predictive ability and to ensure its not over fitting to the data set.

Data and Exploratory Analysis

Prior to fitting any models we examine the dataset used in our analysis. We consider the following predictor variables for our linear model:

We next examine the data visually to highlight any general trends and detect any anomalies. We use scatter plots below to examine how each predictor correlates with demand.

<<<<<<< HEAD
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

=======

>>>>>>> origin/main

Key observations from the analysis indicate that wind has no clear correlation with average demand and was therefore removed from the model as it provides no predictive value. Most the solar data are concentrated between 0 and 0.05, with higher solar generation associated with lower average demand, suggesting an inverse relationship. This aligns with expectations, as increased sunlight reduces the need for lighting, lowering electricity demand. Temperature, TO and TE all show a negative linear correlation with average demand, suggesting that demand tends to decrease as temperature rises. This is intuitive since as warmer conditions generally reduce heating needs. TE shows the strongest negative effect and is considered the best temperature measure because it incorporates information from both today and the previous day. Its importance lies in averaging with the past day, capturing a lagged effect that may reflect customers updating heating settings with a delay. The days since november variable follows an overall quadratic distribution, but has a pronounced dip in December which suggests that an interaction between month and DSN should be considered in the model. Month has a clear effect on demand, with the highest levels observes in January and February and the lowest in November and March. Finally, day of the week is an important predictor to include in the model as demand is lower on weekends, highest on weekdays, with Monday through Thursday showing similar mean demand and a slight decline on Fridays. Combining weekdays and weekends reduce the resolution of these important differences. Year also has an impact on demand and will be included as a factor variable. However, this will be stripped out of the model by NESO, who will add their own year effect for future years.

Model fitting and cross-validation results

Metrics & Linear Models

We aim to fit a ordinary least-squares linear regression model of the form \[\mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}.\] In this type of model we assume the following

  • Linearity of the model parameters.
  • The errors, \(\boldsymbol{\epsilon}\), are normally distributed.
  • The errors are independent.
  • The errors have constant variance (homoscedasticity).

Hypothesis Testing

To check whether a variable is significant in the model or not, we use the following hypothesis test about the model parameters. \[\begin{eqnarray*} H_0&:& \beta_i = 0 \\ H_1&:& \beta_i \neq 0. \end{eqnarray*}\] using the test statistic \(T = \frac{\hat{\beta}_i - \beta_0}{\hat{\sigma}_{\hat{\beta}_i}} \sim t_{n-p}\). The resulting p-value of the test gives the probability of observing a value as or more extreme than the observed test statistic given that \(H_0\) is true. The model summary output gives the p-value of these tests for each of the model parameters. A high p-value suggests that we cannot reject the null hypothesis that \(H_0\) is true, that is that the associated parameter in the model is equal to 0, and the variable has no statistically significant effect in the model.

Adjusted R-squared

The \(R^2\) metric measures the proportion of variance explained by the linear model. Formally it is defined as \[\begin{equation*} r^2 = 1 - \frac{\sum_{i}^{n}(y_i - \hat{\mu}_i)^2}{\sum_{i}^{n}(y_i - \overline{y})^2} \end{equation*}\] where \(\mu_i\) is the estimated value of \(y_i\) by the model and \(\bar{y}\) is the sample mean of the data. The \(R^2\) tends to overestimate how well a model is doing, and so to account for this we look at the adjusted \(R^2\) metric which accounts for the number of parameters in the model. \[\begin{equation*} r^2_{\text{adj}} = 1 - \frac{\frac{\sum_{i}^{n}(y_i - \hat{\mu}_i)^2}{(n-p)}} {\frac{\sum_{i}^{n}(y_i - \overline{y})^2}{(n-1)}} \end{equation*}\] Models with a higher adjusted \(R^2\) are preferable over models with a lower one, as this indicates that the model explains a higher proportion of the variability within the data.

Akaike Information Criterion (AIC)

The AIC is a metric used to compare the quality of a model for a given dataset. It balances model fit with model complexity by penalising the number of parameters included in the model. Formally, AIC is defined as \[\begin{equation*} AIC=2p-2\ln(L) \end{equation*}\] where \(p\) is the number of parameters in the model and \(L\) is the maximised value of the likelihood function.

Lower AIC indicate a better balance between fit and model complexity. When comparing models, the model with the lowest AIC is generally preferred, as it explains the data well without including unnecessary parameters.

Model selection

Based on the exploratory data analysis, the final model for predicting average demand was selected to capture the most important predictors and accounting for non-linearities and interactions <<<<<<< HEAD identified in the data while balancing simplicity. The final mode is:

….. (unsure how to write final model)

Table here

======= identified in the data while balancing simplicity. The final model is:

not sure if this is the best way to format it but thought I would add it in for now…

\[\begin{eqnarray*} M_F: y_i = \beta_0 + \beta_1 \text{solar}_i + \beta_2 \text{wdaynumber}_i + \beta_3 \text{month}_i + \beta_4 \text{year}_i + \beta_5 \text{DSN}_i + \beta_6 \text{TE}_i + \beta_7 \, \text{DSN}^2\text{:month} + \epsilon_i \end{eqnarray*}\] where \(\epsilon_i \overset{\mathrm{iid}}\sim \mathcal{N}(0, \sigma^2)\).

>>>>>>> origin/main

The TE variable was selected as the temperature predictor variable because it incorporates information from both the current and previous day, capturing potential lagged effects in demand. This choice was supported quantitatively. When comparing modes that differed only in the temperature variable, TE produced a higher \(R^2\), lower RMSE and lower AIC indicating a better overall model fit and predictive performance.

<<<<<<< HEAD ======= >>>>>>> origin/main
Model Performance Comparison
Model \(R^2\) Adjusted \(R^2\) RMSE AIC
Model with temp 0.8585733 0.8569565 1533.105 62062.24
Model with TO 0.8478952 0.8461564 1589.928 62319.91
Model with TE 0.8696681 0.8681782 1471.741 61773.03

Solar generation was included because higher solar output was associated with lower average demand and was statistically significant in the final model, \(p<0.01\). Day of the week effects were modelled as a factor to account lower weekend demand and small variations across weekdays with all factor levels highly significant, \(p<0.01\). The variable DSN (days since November) exhibited a quadratic relation with demand in the earlier analysis so \(DSN^2\) was included to capture this non-linear trend. Month and Year were included as factor variables to account for seasonal and annual variation in demand. A pronounced dip in December motivated the inclusion of an interaction between DSN^2 and Month, allowing the model to account for monthly variations in the quadratic effect and substantially improving the AIC (decrease of 1,846.95).

While the inclusion of all two-way interactions followed by backwards selection yielded a slightly higher adjusted \(R^2\), this model was chosen to favour simplicity, as the additional interaction terms contributed only marginal effects.

<<<<<<< HEAD =======
Table 1: Regression Results
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33526.962 321.157 104.394 0.000
TE -529.501 10.654 -49.700 0.000
solar_sarah -15846.613 1141.001 -13.888 0.000
factor(WeekdayNum)1 5596.396 93.053 60.142 0.000
factor(WeekdayNum)2 6324.780 93.121 67.920 0.000
factor(WeekdayNum)3 6360.136 93.097 68.317 0.000
factor(WeekdayNum)4 6279.360 93.250 67.339 0.000
factor(WeekdayNum)5 5464.630 93.218 58.622 0.000
factor(WeekdayNum)6 1169.326 93.079 12.563 0.000
I(DSN^2) 0.660 0.041 16.114 0.000
factor(Month)2 4825.442 455.153 10.602 0.000
factor(Month)3 5306.641 496.149 10.696 0.000
factor(Month)11 3286.746 266.867 12.316 0.000
factor(Month)12 9149.686 289.650 31.589 0.000
factor(Year)1992 -34.595 226.488 -0.153 0.879
factor(Year)1993 -26.648 226.925 -0.117 0.907
factor(Year)1994 453.790 227.098 1.998 0.046
factor(Year)1995 1502.790 226.743 6.628 0.000
factor(Year)1996 2365.578 227.285 10.408 0.000
factor(Year)1997 2804.758 227.065 12.352 0.000
factor(Year)1998 3134.583 227.123 13.801 0.000
factor(Year)1999 3779.208 226.878 16.657 0.000
factor(Year)2000 4435.593 226.786 19.558 0.000
factor(Year)2001 5206.297 226.702 22.965 0.000
factor(Year)2002 5574.093 227.401 24.512 0.000
factor(Year)2003 6347.567 226.968 27.967 0.000
factor(Year)2004 6576.452 226.678 29.012 0.000
factor(Year)2005 5704.218 226.901 25.140 0.000
factor(Year)2006 5376.686 226.954 23.691 0.000
factor(Year)2007 4761.861 227.234 20.956 0.000
factor(Year)2008 4100.558 226.620 18.094 0.000
factor(Year)2009 2600.179 226.727 11.468 0.000
factor(Year)2010 2864.807 228.077 12.561 0.000
factor(Year)2011 2258.601 227.157 9.943 0.000
factor(Year)2012 1629.380 226.662 7.189 0.000
factor(Year)2013 1119.108 227.056 4.929 0.000
factor(Year)2014 488.905 227.196 2.152 0.031
I(DSN^2):factor(Month)2 -0.708 0.053 -13.357 0.000
I(DSN^2):factor(Month)3 -0.728 0.047 -15.418 0.000
I(DSN^2):factor(Month)11 0.502 0.225 2.231 0.026
I(DSN^2):factor(Month)12 -3.808 0.079 -48.028 0.000
>>>>>>> origin/main

Evaluation of Model

T

Cross Validation:

Methodology

To assess model generalizability, we analysed our model using expanding window cross-validation. However, our model includes year as a factor which would be updated by NESO each year based on their subjectively estimated year effect. To mimic this process we altered our cross validation method as follows:

Firstly we fit the model on an initial training set from 1991 up to 2000. Then any the year effect is removed from the trained model. Using this trained model, we compute the ‘yearless predictions’ for 2001. To mimic adding back in a year effect (as NESO would) we assumed a known year effect, using the coefficient estimated from the full dataset. This was added to the yearless predictions’ to compute our final predictions. We then compare these with the observed data for 2001 and obtain the following metrics:

  • Squared error: \(\text{SE} = (y - \hat{y}_F)^2\)

  • Interval score: \(\text{IS}(\alpha) = U_F - L_F + \frac{2}{\alpha} (L_F - y) \mathcal{1}\{y \leq L_F\} + \frac{2}{\alpha} (y - U_F) \mathcal{1}\{y > L_F\}\) where \(L_F\) and \(U_F\) denote the lower and upper bound of the prediction interval with coverage probability \(\alpha\).

  • Dawid-Sebastiani: \(\text{DS} = \frac{(y - \hat{y}_F)^2}{\sigma^2_F} + \log(\sigma^2_F)\)

We then repeat this process, expanding the training set by one year and moving the test set forward one year.

There are drawbacks to this method: for example by assuming the known year effect from the full model, we leak information about the test year into the predictions. However we still choose this method for two reasons: Firstly, it still minimises the disrupt to the chronilogical strucutre of the data. Secondly, expanding the training set at each step, incorporates more historical data helping the model capture long-term, recurring seasonal and weekly trends more effectively.

Results

The observed vs predicted values from our cross validation are plotted in Figure 2. The points lie scattered around the \(y=x\) line indicating accuracy in predictions. It is important to note that there are some points that stray above the \(y=x\) line. However these occur at lower demand levels. Since NESO are particularly interested in accuracy at higher demand levels where shortfalls are likely to occur, this indicates that the model still remains well-suited for accurately estimating average daily demand, especially at high levels.

Using cross validation we find the following predictive scores.

Predictive Scores for Final Model
Model RMSE Mean Dawid Sebastiani Score Mean Interval Score
Final 1462.544 15.58537 9085.693

To interpret these results its useful to compare it to a baseline model \(M_1\) and a more complex model \(M_2\):

Predictive Scores Comparison by Model
Model RMSE Mean Dawid Sebastiani Score Mean Interval Score
\(M_1\) 1872.626 16.09661 12492.79
\(M_2\) 1872.626 16.09661 12492.79

Observe that each score for the final model is lower compared to \(M_1\). This suggests higher accuracy in predictions as well as confidence associated with these predictions. On the other hand observe that the scores for the model \(M_2\) with several interactions are marginally smaller than our final model. This indicates that simplifying the model the interest of practicality is justified.

To assess whether our final model is overfitting, we compare the RMSE based on the full set of data with the RMSE derived from the Cross Validation:

How the model performs on trained data vs new data
Fitted RMSE Cross Validation RMSE
1471.741 1462.544

The fitted RMSE is slightly lower than the cross-validation RMSE, likely due to smaller estimation and prediction sets in cross-validation. However, the difference is small relative to the data scale and not significant, suggesting the model does not perform significantly better on the data it was trained on compared to new data. <<<<<<< HEAD Therefore we conclude the chosen model is not overfitting.

Comparing Models

We next use cross-validation to compare the basic model with our chosen model. For each model, we plot predicted demand against observed demand, with color indicating the fold the data belongs to. Ideally, points should align with the black line \(y=x\), indicating accurate predictions.

For the basic model (left), the plot shows a wide dispersion around \(y=x\), forming a broad parallelogram shape. We see significant overestimation at lower observed demand levels and underestimation at higher levels. This indicates the model lacks complexity to capture key patterns.

In contrast, the chosen model (right) aligns more closely with \(y=x\). While we still see some overestimation and underestimation, the predictions are generally more accurate. However, there is a scatter of points above the line which are significant overestimations. This is a limitation of the model. However in general these are scattered and more occasional. On the other hand the model avoids large underestimation and generally maintains consistency around the \(y = x\) line. This highlights that the chosen model better captures demand patterns and generalizes more effectively.

The predictive scores, printed in the table below, further indicate that the chosen model forecasts peak demand levels more accurately.

Firstly, the RMSE for the basic model is \(3702.12\) while for the final model is \(2462.88\). These both seem quite large but we can consider this in terms of the scale of the data. For example, considering it as a percentage relative to the mean peak electricity demand, \(49294.32\), we find the RMSE for the basic model is \(\frac{3702.12}{49294.32} * 100 = 7.51\%\) and the RMSE for the final model is \(\frac{2462.88}{49294.32} * 100 = 4.99\%\). Furthermore, the Dawid Sebastiani score and the interval score are both lower for the final model compared to the basic model. These results indicates that the final model has a stronger performance than the basic model.

Analysis of Prediction Performance by Month

To measure the prediction accuracy of the final model across all of the months, the following 3 prediction scores:

  • Squared error score
  • Dawid Sebastiani score
  • Interval score

These were computed using the 24-fold cross validation scheme as defined earlier. The average of each of these scores, grouped by month, are displayed in the graphs below.

It is clear from all three scores that the prediction accuracy of the model is not consistent across all months. The model makes the least accurate predictions for the month of December, as highlighted by the fact that the prediction score for this month is the highest across all three scores. This may be due to more variation in energy usage in December, due to disruption to standard working hours in the holiday/festive season. Energy usage may increase for example due to Christmas lights.

The model makes the most accurate predictions for the month of February, as highlighted by the fact that the prediction score for this month is the lowest across all three scores.

Analysis of Prediction Performance by Day Type

We next investigate how well the final model predicts for weekdays vs weekends. To do this we first plot the predicted values against observed demand for the weekend and weekdays individually.

Observe that for the weekend (right) the points mostly align closely with \(y=x\) with a slight underestimation as demand increases. On the other hand for weekdays (left), while most points align closely with \(y=x\) for the high demand levels, as demand decreases the values are more scattered. In particular, there is slight underestimation of demand and some significant overestimations of demand. This trend makes sense, as we saw during data exploration that weekdays tend to have higher demand levels. As a result, low demand levels are relatively uncommon, making them harder to predict accurately. Ideally, however, the model would be able to account for weekdays where the demand was uncommonly low or weekend where demand was uncommonly high. This limitation of the model is discussed further in the limitations section.

Model Limitations

Despite the strengths of our final model, there are limitations that must be taken into account when interpreting our results.

  • Firstly, it is clear that our model overestimates the peak electricity demand in the lower tails, when the peak demand is relatively low. However, NESO has expressed that overestimating is less of a priority for them than failing to meet high peak demand, making this less of a concern.

  • As mentioned above, if the actual demand is low on a weekday, the model is more likely to predict demand inaccurately. While this makes sense, since in general the demand is higher on weekdays, ideally the model would be able to account for this. A potential way of adapting the model to account for this could be considering day-type as a factor rather than weekday index.

  • As previously mentioned, the residuals of our model are not normally distributed. However this is the least important assumption as we can appeal to the Central Limit Theorem as the size of the sample grows.

Many of the variables are highly correlated with each other (e.g. TO vs TE). This is a drawback of the data we have been given.

======= Therefore we conclude the final model is not overfitting.

>>>>>>> origin/main